Cocojunk

🚀 Dive deep with CocoJunk – your destination for detailed, well-researched articles across science, technology, culture, and more. Explore knowledge that matters, explained in plain English.

Navigation: Home

Threaded code

Published: Sat May 03 2025 19:23:38 GMT+0000 (Coordinated Universal Time) Last Updated: 5/3/2025, 7:23:38 PM

Read the original article here.

Threaded Code: Mastering Compact & Low-Level Execution

Welcome, fellow explorers of the digital underground! In our journey through "The Forbidden Code," we delve into techniques often overlooked by mainstream education, yet fundamental to understanding how software truly runs. Today, we excavate a powerful, historically significant method for building interpreters and compilers: Threaded Code.

Forget monolithic machine code blocks or verbose script interpretation. Threaded code offers a lean, mean approach to program execution, born from necessity but offering unique advantages even today. It's a technique where your program becomes essentially a sequence of calls, executed by a surprisingly simple underlying engine.

What is Threaded Code?

At its core, threaded code is a programming technique focused on creating highly compact executable representations of programs.

Threaded Code: A programming technique where the executable form of a program consists primarily, or entirely, of a sequence of calls to subroutines (often called "words" or "primitives" in this context).

Instead of a compiler translating high-level code into a long sequence of basic machine instructions, it translates it into a sequence of addresses or tokens that represent calls to pre-compiled subroutines. A small, fast runtime loop then "walks" this sequence, executing the designated subroutine for each entry.

Think of it like a playlist for your computer's CPU, where each "song" is a small, pre-written function, and the playlist itself is just a list of pointers to these songs. A tiny DJ program reads the playlist and plays each song in order.

This approach was particularly revolutionary for:

Code Density: Threaded code can be significantly smaller than traditionally compiled machine code.
Compiler/Interpreter Design: It provides a clean structure for building virtual machines and the compilers that target them.

While potentially slightly slower than highly optimized native code due to the overhead of dispatching each call, its compactness can sometimes lead to performance improvements on systems with small caches, as the entire program or critical sections might fit into the CPU's high-speed memory.

Threaded code is famously associated with the Forth programming language, but has also appeared in implementations of BASIC, COBOL, early B (a precursor to C), and systems for resource-constrained environments like minicomputers and satellites.

The Historical Necessity: Battling Memory Limits

To truly appreciate threaded code, we must journey back to an era where kilobytes, not gigabytes, were the standard unit of computer memory.

Context: Early Computing Constraints In the 1960s and 70s, computers like the Data General Nova, IBM 1130, and the first microcomputers (e.g., Altair 8800) often had only 4 KB (yes, four thousand bytes!) of RAM. This was a critical limitation that drove intense innovation in code compaction techniques.

Software engineers faced a dilemma:

Full Compilation: Translating source code into machine code resulted in fast programs. However, the compiler itself, the source code, and the resulting machine code all needed memory. The generated machine code could be quite verbose, especially if common operations were repeated inline. This often made the compiled executable too large to fit alongside other necessary components in limited memory. Plus, machine code is tied to a specific hardware architecture, hindering portability.
Full Interpretation: Reading and executing source code line by line using an interpreter saved memory because the interpreter itself could be relatively small, and the source code is generally denser than machine code. The interpreter would call pre-compiled routines. However, interpreting source code directly is much slower than executing pre-processed code.

Threaded code emerged as a compelling middle ground.

Threaded Code's Solution: It involves a compilation step, but instead of generating native machine code for every operation occurrence, it generates a sequence of references (addresses or tokens) to existing machine code subroutines. This dramatically reduces the size of the executable program sequence itself, as common operations (like adding two numbers, pushing a value onto a stack, etc.) are implemented once as a subroutine, and the main program just lists the references to these subroutines in the required order.

This addresses the memory problem: the subroutines are compiled once, and the program structure is a compact list of subroutine addresses. A small runtime "interpreter" (often called the "inner loop" or "dispatcher") reads this list and jumps to the corresponding subroutine.

Early processors sometimes required multiple machine instructions just to perform a standard subroutine call (e.g., loading the address, performing the jump, managing the return address). Threaded code avoided repeating this multi-instruction sequence throughout the program by having the small dispatcher handle the transition logic just once per subroutine execution. While modern processors made standard calls much faster and more compact (often a single instruction), the core idea of representing program flow as a dense sequence of subroutine references remained valuable for memory-constrained systems and specific architectural designs.

The Core Mechanism: Threading the Needle

The essence of threaded code is replacing repeated segments of code, particularly subroutine calls, with a simpler, more compact representation.

Consider a program sequence like: Push A, Push B, Add, Pop C

In traditional compiled code, this might involve loading addresses, setting up stack frames, calling functions, etc., for each operation.

In threaded code, this sequence is represented as a list: Address_of_Push_A_Routine, Address_of_Push_B_Routine, Address_of_Add_Routine, Address_of_Pop_C_Routine

Or, potentially even more compactly, using tokens: Token_for_Push, Operand_A, Token_for_Push, Operand_B, Token_for_Add, Token_for_Pop, Operand_C

A small piece of code, the "dispatcher" or "inner loop," is responsible for reading this list (the "thread") and executing the corresponding operation for each entry. It maintains an Instruction Pointer (IP) (distinct from the hardware's program counter) which points to the current location in the thread.

Dispatcher / Inner Loop: The minimal runtime engine in a threaded code system. It repeatedly:

Fetches the next entry (address or token) from the thread using the Instruction Pointer (ip).

Determines the actual code to execute based on the fetched entry.

Transfers control to that code (the subroutine).

Once the subroutine finishes, the dispatcher (or the subroutine itself) updates the ip to point to the next entry in the thread.

Over time, variations in how the entries are represented and how the dispatch happens led to different "threading models."

Threading Models: Different Paths Through the Code

How the list of operations is structured and interpreted gives rise to various threading models, each with its own trade-offs in terms of density, speed, and complexity.

1. Direct Threading (DTC)

Direct Threading: A threading model where the entries in the thread are direct memory addresses of the machine code subroutines to be executed.

This is one of the simplest forms. The dispatcher's job is straightforward: fetch the address at the current ip, jump to that address, and when the subroutine finishes, jump back to the dispatcher to increment ip and fetch the next address.

Example (Conceptual Pseudocode):

Imagine a list thread containing addresses:

thread:
  &pushA    ; Address of routine to push A
  &pushB    ; Address of routine to push B
  &add      ; Address of routine to add top two stack items
  &popC     ; Address of routine to pop stack top into C
  ...

The dispatcher loop:

top:
  routine_address = [ip]  ; Fetch address from thread using ip
  ip = ip + word_size     ; Move ip to next entry
  jump routine_address    ; Transfer control to the routine

Each routine would end by jumping back to top. However, a common optimization (shown in the Wikipedia source) is to inline the ip = ip + word_size; jump top logic at the end of each subroutine, creating a single jump chain directly from one routine's end to the next routine's start. This avoids the extra jump through the top loop.

Example with Inlined Dispatch:

thread:
  &pushA
  &pushB
  &add
  &popC
  ...

; Dispatcher starts at &thread and sets ip to &thread
; Then jumps to the routine at &thread (&pushA)

pushA_routine:
  ; Machine code to push A onto stack
  ...
  ip = ip + word_size   ; Advance ip (points to &pushB)
  jump [ip]             ; Jump to the routine at the new ip (&pushB)

pushB_routine:
  ; Machine code to push B onto stack
  ...
  ip = ip + word_size   ; Advance ip (points to &add)
  jump [ip]             ; Jump to the routine at the new ip (&add)

add_routine:
  ; Machine code to add top two stack items
  ...
  ip = ip + word_size   ; Advance ip (points to &popC)
  jump [ip]             ; Jump to the routine at the new ip (&popC)

popC_routine:
  ; Machine code to pop stack top into C
  ...
  ip = ip + word_size   ; Advance ip (points past &popC)
  jump [ip]             ; Jump to the routine at the new ip (whatever comes next)

This forms a "thread" of jumps.

Handling Parameters in DTC: Parameters can follow the address in the thread:

thread:
  &push_value ; Address of routine to push a value
  value_A     ; The value A
  &push_value ; Address of routine to push a value
  value_B     ; The value B
  &add        ; Address of add routine
  ...

The push_value routine would read the value following its address in the thread ([ip]), push it, and then advance ip past the value before jumping to [ip]. Routines without parameters (like add) just advance ip by one word size. This adds complexity to the routines/dispatcher but avoids extra memory lookups for parameters.

Pros: Relatively simple to implement, often faster than indirect threading due to fewer memory dereferences per dispatch. Cons: Less compact than indirect or token threading if parameters are embedded.

2. Indirect Threading (ITC)

Indirect Threading: A threading model where the entries in the thread are pointers to memory locations (often called "code field addresses" or "indirect blocks") that in turn point to the machine code subroutines.

This adds a level of indirection compared to DTC. Why? For better code density, especially if subroutines need parameters or other metadata. The thread itself is a list of pointers to these indirect blocks. The block contains the actual machine code address and potentially operands or other information associated with that specific use of the subroutine.

Example (Conceptual Pseudocode):

thread contains addresses of indirect blocks:

thread:
  &pushA_block ; Pointer to indirect block for pushing A
  &pushB_block ; Pointer to indirect block for pushing B
  &add_block   ; Pointer to indirect block for adding
  &popC_block  ; Pointer to indirect block for popping C
  ...

; Indirect blocks:
pushA_block:
  &push_routine  ; Pointer to the actual push machine code
  value_A        ; Parameter for push (value A)

pushB_block:
  &push_routine  ; Pointer to the same push machine code
  value_B        ; Parameter for push (value B)

add_block:
  &add_routine   ; Pointer to the actual add machine code
  ; No parameters needed here

popC_block:
  &pop_routine   ; Pointer to the actual pop machine code
  dest_C         ; Parameter for pop (destination C)

The dispatcher (using ip for the thread pointer and O for the current object/block pointer, as in the RPL example):

; Dispatcher starts with ip pointing to &thread

next_instruction:
  O = [ip]            ; Fetch the address of the indirect block (e.g., &pushA_block)
  ip = ip + word_size ; Advance ip to the next entry in the thread (&pushB_block)
  target_address = [O] ; Fetch the actual routine address from the block (e.g., &push_routine)
  jump target_address ; Jump to the routine

The routine (push_routine, add_routine, etc.) would then access its parameters (if any) from the memory location after the target address in its block (e.g., O + word_size). After executing, the routine would jump back to next_instruction (or the inlined version of it) to process the next entry pointed to by the now-advanced ip.

This is the model famously used by Charles H. Moore in his original Forth implementations, leveraging hardware indirection features on the Data General Nova.

Pros: More compact than DTC, especially when parameters are involved, as the parameters are stored once per block definition rather than repeatedly in the main thread. Cons: Slower than DTC or Subroutine Threading due to the extra memory dereference per dispatch ([O] in the example).

3. Subroutine Threading (Call Threading)

Subroutine Threading: A threading model where the thread consists of a sequence of standard machine-level CALL instructions, where each CALL targets a subroutine.

This is arguably the most "conventional" form from a modern perspective, as it relies directly on the processor's built-in subroutine call mechanism.

Example (Conceptual Assembly/Pseudocode):

thread:
  CALL pushA_routine ; Machine instruction CALL
  CALL pushB_routine
  CALL add_routine
  CALL popC_routine
  ...

pushA_routine:
  ; Machine code to push A
  RET ; Machine instruction RETURN

pushB_routine:
  ; Machine code to push B
  RET

add_routine:
  ; Machine code to add
  RET

popC_routine:
  ; Machine code to pop
  RET

The main program simply consists of this sequence of CALL instructions. The CPU's hardware handles pushing the return address onto the hardware return stack and jumping to the subroutine. The RET instruction pops the return address and jumps back, effectively transferring control to the instruction after the CALL in the thread.

Pros: Leverages highly optimized hardware CALL/RET instructions, often making it faster than indirect threading and sometimes faster than direct threading (depending on the specific CPU architecture and Ertl's findings cited in the source). Simple structure from a compiler perspective. Cons: Less compact than direct or indirect threading because each entry in the thread includes the full CALL opcode(s) in addition to the address. Still less compact than token threading.

4. Token Threading (Bytecode)

Token Threading: A threading model where the thread is a sequence of small integers (tokens or opcodes), each serving as an index or key into a table of operations (which are typically subroutine addresses or machine code).

This is the champion of code density among the common methods. The tokens are typically small (e.g., 1 byte), making the thread significantly smaller than lists of full memory addresses.

Example (Conceptual):

Define a table mapping tokens to addresses:

Operation_Table:
  0x01: &push_routine
  0x02: &add_routine
  0x03: &pop_routine
  ...

thread contains tokens and sometimes parameters:

thread:
  0x01 ; Token for push
  value_A ; Parameter A
  0x01 ; Token for push
  value_B ; Parameter B
  0x02 ; Token for add
  0x03 ; Token for pop
  dest_C ; Parameter C
  ...

The dispatcher (using ip for the thread pointer):

next_instruction:
  token = [ip]          ; Fetch the token (e.g., 0x01)
  ip = ip + token_size  ; Advance ip past the token

  ; Decode the token and dispatch
  routine_address = Operation_Table[token] ; Look up address
  ; Depending on the token, fetch parameters from [ip] if needed
  ; and advance ip past them.

  jump routine_address  ; Jump to the routine

This dispatch loop is slightly more complex as it needs logic to fetch parameters based on the token type. A common implementation uses a switch statement or a branch table indexed by the token for fast dispatching.

Bytecode: A very common form of token threading, typically using 8-bit opcodes and designed for a stack-based virtual machine architecture. Examples include Java bytecode, .NET Common Intermediate Language (CIL), and Python's bytecode.

Dispatcher for simple Bytecode:

next_instruction:
  opcode = [ip]         ; Fetch 1-byte opcode
  ip = ip + 1           ; Advance ip

  jump Operation_Dispatch_Table[opcode] ; Use opcode as index to jump

Operation_Dispatch_Table:
  ; ... addresses of opcode handlers ...

Each opcode handler (like push_handler, add_handler) performs its action, potentially reads parameters from ip if needed, advances ip accordingly, and then jumps back to next_instruction.

Pros: Highest code density among common methods. Forms the basis for widely used virtual machines (Java, .NET). Can sometimes exhibit better cache performance than native code for small programs. Cons: Highest overhead per dispatch for simple operations compared to direct or subroutine threading due to the decoding and table lookup step. Can consume both instruction cache (for handlers) and data cache (for the thread/tables).

5. Huffman Threading

Huffman Threading: A highly optimized form of token threading where tokens are represented by variable-length Huffman codes, assigned based on the frequency of subroutine usage in the program.

This takes density to an extreme. By assigning shorter bit sequences to frequently used operations and longer sequences to rare ones, the total size of the thread can be minimized.

Example (Conceptual):

Suppose add is very frequent, multiply is less frequent. Instead of fixed tokens (e.g., 0x02 for add, 0x04 for multiply), Huffman codes might be: add: 1 (1 bit) multiply: 010 (3 bits) push: 001 (3 bits)

The thread is now a string of bits. The dispatcher becomes more complex, having to read bits from the thread and use them to traverse a tree or table to decode the token and find the corresponding routine address.

Pros: The most compact representation of a program known besides pure compression. Ideal for extremely memory-constrained environments. Cons: Significant overhead in the dispatcher to read and decode the variable-length bit sequences. Complex to implement.

6. RPL (HP Calculators)

RPL (Reverse Polish Lisp): A proprietary hybrid (direct-threaded and indirect-threaded) interpretive language used in HP calculators. It features a sophisticated, patented inner loop capable of executing both standard thread entries and embedded "objects."

RPL is a unique example of a highly refined threaded system. Its "runstream" (the thread) can contain not just addresses or tokens, but also data "objects." The dispatcher loop needs to differentiate between these. The core dispatch logic involves a couple of levels of indirection and a check to see if the current entry is a standard instruction pointer or an embedded object.

The Patented RPL Inner Loop (Simplified Conceptual Trace):

Let I be the Interpreter Pointer (points into the thread), O be the Current Object/Instruction Pointer, Δ be the size of a memory address, and PC be the hardware Program Counter.

Fetch the next entry from the thread: O = [I]
Advance the thread pointer: I = I + Δ
Fetch the address from the fetched entry (usually the second level of indirection, typical for instructions): target_address = [O]
Jump to the target address: PC = target_address + Δ (The + Δ might be for skipping a prolog or header in the target code).

The code at target_address (the "prolog") determines how to proceed. If it's a standard instruction prolog, it might execute the instruction and jump back to step 1. If O actually pointed to an embedded object instead of an indirect block, the prolog logic would detect this (e.g., by comparing PC to O + Δ) and adjust I to skip over the embedded object's data before returning to step 1, allowing the object's data to be processed differently.

Pros: Highly flexible due to the ability to embed data/objects directly in the execution stream. Optimized for specific HP hardware. Cons: Complex dispatch mechanism. Proprietary.

7. Lesser-Used Threading (String Threading)

String Threading: A historical or experimental threading model where operations are identified by their string names, often looked up using a hash table.

This is inefficient for general execution but was used in very early systems (like initial Forth versions) and experimental hardware where flexibility was prioritized over raw speed or density. Looking up a string for every single operation is slow compared to indexing a table with an integer token or using direct/indirect pointers.

Pros: Very flexible, easy to implement initially. Cons: Very slow due to string lookups. Poor density.

Implementing Control Flow: Branches

In threaded code systems, control flow (like if statements, loops, and jumps) is managed by simply manipulating the Instruction Pointer (ip) that walks the thread.

A conditional branch, for example, involves a specific subroutine that checks a condition (usually on a separate data stack) and, based on the result, either advances the ip normally (to the next instruction after the branch) or updates the ip to a specified target address within the thread.

Example: Conditional Branch (using embedded parameter, like DTC)

Let's say we have a branch-if-zero routine (&branch_if_zero) that expects the jump target address immediately following it in the thread.

thread:
  ...
  &push_value
  0         ; Push the value 0
  &branch_if_zero ; Address of the branch routine
  &thread[123] ; The target address to jump to if zero
  &instruction_after_branch_not_taken ; This instruction is executed if not zero
  ...
thread[123]:
  &instruction_if_zero_taken ; This instruction is executed if zero
  ...

The branch_if_zero_routine:

branch_if_zero_routine:
  condition_value = pop_from_data_stack() ; Get value to check
  target_address = [ip]                   ; Read the jump target from the thread
  ip = ip + word_size                     ; Advance ip past the target address (pointing to the *next* instruction in the thread)

  if condition_value == 0:
    ip = target_address                 ; If zero, set ip to the jump target
    ; Note: The standard dispatch loop (&jump [ip]) will now jump to the instruction_if_zero_taken
  else:
    ; ip is already pointing to instruction_after_branch_not_taken
    ; The standard dispatch loop (&jump [ip]) will execute the next instruction in sequence.

  ; Now, jump back to the dispatcher's main loop (or its inlined version)
  ; to fetch and execute the instruction pointed to by the updated ip.
  jump [ip]

By simply changing the ip, the flow of execution through the threaded sequence is altered.

Common Architecture and Primitives

Threaded code systems, particularly those based on stack-oriented languages like Forth, share common architectural features to efficiently support the execution model.

Dual Stacks: A common amenity in threaded code systems is the use of two distinct stacks:

Data Stack (Parameter Stack): Used for passing arguments between subroutines and performing calculations. Operations (like add, push) manipulate values on this stack.

Return Stack: Used by the threading mechanism to store return addresses when a subroutine itself calls other subroutines (i.e., nesting). This keeps data manipulation separate from control flow management.

Key Registers: Threaded virtual machines often utilize specific registers (even if they are just dedicated memory locations in software) to manage the execution state:

ip (Instruction Pointer): Points to the current entry in the thread sequence being executed.

rp (Return Stack Pointer): Points to the top of the Return Stack.

sp (Parameter Stack Pointer): Points to the top of the Data/Parameter Stack.

w (Work Pointer): Often used by the dispatcher or current subroutine to point to the current entry's definition or parameters (e.g., the O pointer in the ITC or RPL examples).

Core Primitives (The VM's Heartbeat): In many indirect-threaded systems, the entire complexity of the virtual machine can be distilled into just three fundamental operations:

next: The core dispatch loop. Fetches the next item from the thread ([ip]), updates ip, determines the target address ([[ip]] for ITC, using w), and jumps to it. This is the heartbeat that drives the thread.

nest (or docol): Used when a threaded "word" (subroutine) needs to execute another sequence of threaded words (a colon definition in Forth). It pushes the current ip onto the Return Stack, sets w to point to the new sequence, and jumps to next.

unnest (or ;s): Used at the end of a threaded definition. It pops the saved ip from the Return Stack back into the ip register and jumps to next. This returns control to the calling thread sequence.

These simple primitives, combined with a clean stack architecture and dedicated registers, form a surprisingly powerful basis for building complex language implementations.

Why Learn About Threaded Code Today?

While you might not write performance-critical applications using pure software-interpreted string threading, understanding threaded code offers valuable insights into:

Virtual Machine Design: It's a foundational technique for building interpreters and VMs (like those for Java, .NET, Python). Understanding threading clarifies how bytecode is dispatched and executed.
Low-Level Execution: It exposes a layer of abstraction between high-level code and raw hardware, showing how program flow can be managed through data structures (the thread) rather than just linear machine code.
Code Compaction: The principles behind token and Huffman threading are still relevant in resource-constrained environments or for highly compressed data formats.
Historical Context: It provides context for the design choices in early languages and operating systems, and the ingenuity required to work within severe hardware limitations.
Underground Connections: Concepts like Return-Oriented Programming (ROP), a technique used to exploit security vulnerabilities, effectively repurpose existing code sequences ending in returns (RET instructions) to build malicious execution "threads" – a dark echo of subroutine threading. Continuation-Passing Style (CPS) in functional programming is another related concept, managing control flow explicitly, though at a higher level.

Threaded code, though less visible in daily programming than traditional compilation or high-level scripting, remains a testament to clever low-level design and a crucial piece of understanding the diverse ways software can be structured and executed. It's a powerful pattern for achieving density and flexibility, a true "forbidden code" technique worth adding to your arsenal of programming knowledge.